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Abstract. We tackle the problem of multi-class relational sequence learn- 
ing using relevant patterns discovered from a set of labelled sequences. 
To deal with this problem, firstly each relational sequence is mapped 
into a feature vector using the result of a feature construction method. 
Since, the efficacy of sequence learning algorithms strongly depends on 
the features used to represent the sequences, the second step is to find an 
optimal subset of the constructed features leading to high classification 
accuracy. This feature selection task has been solved adopting a wrap- 
per approach that uses a stochastic local search algorithm embedding a 
naive Bayes classifier. The performance of the proposed method applied 
to a real-world dataset shows an improvement when compared to other 
established methods, such as hidden Markov models, Fisher kernels and 
conditional random fields for relational sequences. 

Key words: Relational Sequence Learning, Feature Construction/Selection, 
Stochastic Local Search, Statistical Relational Learning. 



1 Introduction 

Sequential reasoning is a fundamental task of intelligence. Indeed, sequential 
data may be found in a lot of contexts of the every day life. From a computer 
science point of view, sequential data may be found in many applications such 
as video understanding, planning, computational biology, user modelling, speech 
recognition, etc. The sequences are the simplest form of structured patterns and 
different methodologies have been proposed to face the problem of sequential 
pattern mining, firstly introduced in [1], with the aim of capturing the existent 
maximal frequent sequences in a given database. One of the many problems 
investigated concerns assigning labels to sequences of objects. However, some 
environments involve very complex components and features. Thus, the classical 
existing data mining approaches, that look for patterns in a single data table, 
have been extended to the multi-relational data mining approaches that look for 
patterns involving multiple tables (relations) from a relational database. This has 
led to the exploitation of a more powerful knowledge representation formalism 
as first-order logic. 

Indeed, sequential learning techniques may be classified by the language they 
adopt to describe sequences. On the one hand we find algorithms adopting a 
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prepositional language, such as hidden Markov models (HMMs) [2], allowing 
both a simple model representation and an efficient algorithm; on the other hand 
probabilistic relational systems are able to elegantly handle complex and struc- 
tured descriptions where, on the contrary, an atomic representation could make 
the problem intractable to prepositional sequence learning techniques. The aim 
of this paper is to propose a new probabilistic algorithm for relational sequence 
learning [3]. 

A way to tackle the task of learning discriminant functions in relational 
learning corresponds to reformulate the problem into an attribute-value form 
and then applying a prepositional learner [4]. The reformulation process may 
be obtained adopting a feature construction method, such as mining frequent 
patterns that can then be successfully used as new Boolean features [5-7] . Since, 
the efficacy of learning algorithms strongly depends on the features used to rep- 
resent the sequences, a feature selection task should be very useful. The aim 
of feature selection is to find an optimal subset of the input features leading to 
high classification performance, or, more generally, to carry out the classification 
task in a optimum way. However, the search for a variable subset is a NP-hard 
problem. Therefore, the optimal solution cannot be guaranteed to be acquired 
except when performing an exhaustive search in the solution space. The use of 
a stochastic local search procedure allows one to obtain good solutions without 
having to explore the whole solution space. Algorithms for feature selection can 
be divided into two categories: wrapper and filter methods [8]. When the fea- 
ture selection algorithm embeds a classifier and selects subsets of features guided 
by their predictive power predicted by the classifier, it is using a wrapper ap- 
proach. The filter approach selects the features adopting a preprocessing step 
using heuristics based on the intrinsic characteristic of the data and ignoring the 
learner. 

In this paper we propose a new algorithm, named Lynx^ for relational se- 
quence learning that in the first step it adopts a classical feature construction 
approach. As we will see in the following, here the features are not considered as 
Boolean but we are able to associate a probability to each constructed feature. In 
the second step, the system adopts a wrapper feature selection approach, that 
uses a stochastic local search (non-exha\istive) procedure, embedding a nai've 
Bayes classifier to select an optimal subset of the constructed features. In par- 
ticular, the optimal siibset of patterns is searched using a Greedy Randomised 
Search Procedure (GRASP) and the search is guided by the predictive power of 
the selected subset computed using a nai've Bayes approach. 

Hence the focus of the paper is on combining probabilistic feature construc- 
tion and feature selection for relational sequence learning. The aim is to show 
that the proposed approach is comparable to other purposely designed proba- 
bilistic approaches for relational sequence learning. 

The outline of the paper is as follows. After discussing related work in Sec- 
tion 2, we present the Lynx algorithm in Section 3. In particular we will briefly 
present the description language, followed by the description of the feature con- 
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struction and feature selection proposed methods. Before concluding the paper 
in Section 5, we experimentally evaluate Lynx on a real-world dataset. 

2 Related Work 

As already pointed out, the problem of sequential pattern mining is a central one 
in a lot of data mining applications and many efforts have been done in order 
to propose purposely designed methods to face it. Most of the works have been 
restricted to propositional patterns, that is, patterns not involving first order 
predicates. One of the early domains that highlights the need to describe with 
structural information the sequences was the bioinformatics. Thus, the need to 
represent many real world domains with structured data sequences became more 
unceasing, and consequently many efforts have been done to extend existing 
or propose new methods to manage sequential patterns in which first order 
predicates are involved. Related works may be divided into two categories. In 
the first category there are work belonging to the Inductive Logic Programming 
area [9], that reformulate the initial relational problem into an attribute- value 
form, by using frequent patterns as new Boolean features, and then applying 
propositional learners. To the second category belong all the systems purposely 
designed to tackle the problem of relational sequence analysis falling into the 
more specific Statistical Relational Learning area [10] where probabilistic models 
are combined with relational learning. 

This work may be correlated to that in [7] , where the authors presented one 
of the first Inductive Logic Programming feature construction method. They 
firstly construct a set of features adopting a declarative language to constraint 
the search space and to find discriminant features. Then, these features are used 
to learn a classification model with a propositional learner. 

In [11] are presented a logic language, SeqLog, for mining sequences of logical 
atoms, and the inductive mining system MineSeqLog, that combines principles 
of the level-wise search algorithm with the version space in order to find all 
patterns that satisfy a constraint by using an optimal refinement operator for 
SeqLog. SeqLog is a logic representational framework that adopts two operators 
to represent the sequences: one to indicate that an atom is the direct successor 
of another and the other to say that an atom occurs somewhere after another. 
Furthermore, based on this language, the notion of subsumption, entailment and 
a fix point semantic are given. 

These work even if may be correlated to our work, they tackle into account 
the feature construction problem only. Here, however we combine a feature con- 
struction process with a feature selection algorithm maximising the predictive 
accuracy of a probabilistic model. Systems very similar to our approach are those 
that combine a probabilistic models with a relational description such as logical 
hidden Markov models (LoHHMs) [12], Fisher kernels for logical sequences [13], 
and relational conditional random fields [14] that are purposely designed for 
relational sequences learning. 
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In [13] has been proposed an extension of classical Fisher kernels, working 
on sequences over flat alphabets, in order to make them able to model logical 
sequences, i.e., sequences over an alphabet of logical atoms. Fisher kernels were 
developed to combine generative models with kernel methods, and have shown 
promising results for the combinations of support vector machines with (logical) 
hidden Markov models and Bayesian networks. Successively, in [12] the same 
authors proposed an algorithm for selecting LoHMMs from data. HMM [2] are 
one of the most popular methods for analysing sequential data, but they can be 
exploited to handle sequence of flat/unstructured symbols. The proposed logical 
extension [15] overcomes such weakness by handling sequences of structured 
symbols by means of a probabilistic ILP framework. 

Finally, in [14] an extension of conditional random fields (CRFs) to logical 
sequences has been proposed. In the case of sequence labelling task, CRFs are 
a better alternative to HMMs that makes it relatively easy to model arbitrary 
dependencies in the input space. CRFs are undirected graphical models that 
instead of learning a generative model, such as in HMMs, they learn a discrim- 
inative model designed to handle non-independent input features. In [14], the 
authors lifted CRFs to the relational case by representing the potential func- 
tions as a sum of relational regression trees learnt by a relational regression tree 
learner. 

3 Lynx: a relational pattern-based classifier 

This section firstly briefly reports the framework for mining (multi-dimensional) 
relational sequences introduced in [16] to manage patterns in which more than 
one dimension is taken into account. That framework has been used in Lynx due 
to its general logic formalism for representing and mining relational sequences. 
Over that framework Lynx implements a probabilistic pattern-based classifier. In 
particular, after introducing the representation language, the Lynx system along 
with its feature construction capability, the adopted pattern-based classification 
model, and the feature selection approach will be presented. 

3.1 The language 

As a representation language we used a flrst-order logic that we briefly review. 
The flrst-order alphabet consists of a set of constants, a set of variables, a set 
of function symbols, and a non-empty set of predicate symbols. Both function 
symbols and predicate symbols have a natural number (its arity) assigned to it. 
A term is a constant symbol, a variable symbols, or an n-ary function symbol / 
applied to n terms ti,t2, ■ ■ ■ ,tn- 

An atom p{ti, . . . ,tn) (or atomic formula) is a predicate symbol p of arity 
n applied to n terms ti. Both I and its negation I are said to be literals (resp. 
positive and negative literal) whenever I is an atomic formula. 

A clause is a formula of the form \/Xi . . . VX„(ii V . . .V LiV ii+i V ... V L,„) 
where each Lj is a literal and Xj,j = 1, . . . ,n, are all the variables occurring in 
the literals. The same clause may be written as Li, . . . Lj, . . . L^- 
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Clauses, literals and terms are said to be ground whenever they do not contain 
variables. A Datalog clause is a clause with no function symbols of non-zero arity; 
only variables and constants can be used as predicate arguments. 

A substitution is defined as a set of bindings {Xi ^ ai,. . . ,X„ <— o„} 
where Xi,l < i < n is a variable and a^, 1 < i < n is a term. A substitution 
9 is applicable to an expression e, obtaining the expression e9, by replacing all 
variables Xi with their corresponding terms ai. 

Lynx includes the multi-dimensional relational framework, and the corre- 
sponding pattern mining algorithm, reported in [16] that here wc briefly recall. 

A 1- dimensional relational sequence may be defined as an ordered list of 
Datalog atoms separated by the operator <: /i < ^2 < • • • < 

Considering a scqiience as an ordered succession of events for each dimen- 
sion, fluents have been used to indicate that an atom is true for a given event. 
For the general case of n-dimcnsional sequences, the operator <i has been in- 
troduced to express multi-dimensional relations. Specifically, (ei <i 62) denotes 
that the event e2 is the successor event of ei on the dimension i. Hence, A 
multi-dimensional relational sequence may be defined as a set of Datalog atoms, 
concerning n dimensions, where each event may be related to another event by 
means of the <i operators, \ < i < n. 

In order to represent multi-dimensional relational patterns, the following di- 
mensional operators have been introduced. Given a set V of dimensions, Vi € V: 
<i (next step on dimension) indicates the direct successor on the dimension i\ 
<\i (after some steps on dimcinsion) cn(;odc;s the transitive closure of <i\ and Or 
(exactly after n steps on dimension i) calculates the n-th direct successor. 

Hence, a multi- dimensional relational pattern may be defined as a set of Dat- 
alog atoms, regarding n dimensions, in which there are non-dimensional atoms 
and each event may be related to another event by means of the operators <i, 
<\i and O?) 1 < « < n. 

The background knowledge B contains the definitions of the operators Of 
and <i used to prove the dimensional operators appearing in the patterns. Given 
S a multi-dimensional relational sequence, in the following we will indicate by 
S the set of Datalog clauses BUU, where U is the set of ground atoms in S. In 
order to calculate the frequency of a pattern over a sequence it is important to 
define the concept of sequence subsumption. 

Definition 1 (Subsumption). Given S = BuU, where U is the set of atoms 

in a sequence S, and B is a background knowledge. A pattern P subsumes the 
sequence S (P ^ S), iff there exists a,n SLDqi- deduction of P from E. 

An SLDoi-deduction is an SLD-deduction under Object Identity [17]. In the 
Object Identity framework, within a clause, terms that are denoted with different 
symbols must be distinct, i.e. they must represent different objects of the domain. 

3.2 Feature Construction via pattern mining 

The first step of the Lynx system corresponds to a feature construction process 
obtained by mining frequent patterns from sequences. The algorithm for fre- 
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quent multi-dimensional relational pattern mining is based on the same idea of 
the generic level-wise search method, known in data mining from the Apriori 
algorithm [18]. The level-wise algorithm makes a breadth-first search in the lat- 
tice of patterns ordered by a specialization relation ^. The search starts from the 
most general patterns, and at each level of the lattice the algorithm generates 
candidates by using the lattice structure and then evaluates the frequencies of 
the candidates. In the generation phase, some patterns are taken out using the 
monotonicity of pattern frequency (if a pattern is not frequent then none of its 
specializations is frequent). 

The generation of the frequent patterns is based on a top-down approach. The 
algorithm starts with the most general patterns. Then, at each step it tries to spe- 
cialise all the potential frcqiicnt patterns, discarding the non-frequent patterns 
and storing the ones whose length is eqiial to the user specified input parameter 
maxsize. Furthermore, for each new refined pattern, semantically equivalent pat- 
terns are detected, by using the ^^oi-subsumption relation [17], and discarded. 
In the specialization phase, the specialization operator under (^oi-subsumption 
is used. Basically, the operator adds atoms to the pattern. 

The background knowledge The algorithm uses a background knowledge 
B (a set of Datalog clauses) containing the sequence and a set of constraints, 
similar to that defined in SeqLog [11], that must be satisfied by the generated 
patterns. In particular, some of the constraint included in B are (see [16] for 
more details): 

— maxsize(M): maximal pattern length; 

— minfreq(m): this constraint indicates that the frequency of the patterns must 

be larger than m; 

— type(p) and mode(p): denote the type and the input/output mode of the 
predicate's arguments p, respectively. They are used to specify a language 
bias indicating which predicates can be used in the patterns and to formulate 
constraints on the binding of variables; 

— neg constraint ([p\,p2, ■ ■ ■ ,Pn])- specifies a constraint that the patterns must 
not fulfiU, i.e. if the clause {pi,P2, ■ ■ ■ ,Pn) subsumes the pattern then it must 
be discarded; 

— posconstraint([pi,p2, ■ ■ ■ ,Pn])- specifies a constraint that the patterns must 
fulfill. It discards all the patterns that are not subsumed by the clause 

{pi,P2,---,Pn); 

— atmostone([pi,p2, ■ ■ ■ ,Pn])' this constraint discards all the patterns that 
make true more than one predicate among pi,p2j • • • iPn^ 

— key([pi,p2, ■ ■ ■ ,Pn])'- it is optional and specifies that each pattern must have 
one of the predicates pi,p2, ■ ■ - Pn as a starting literal. 

Frequency, Support and Confidence Given a set of relational sequences D 
defined over a set of classes C, then the frequency of a pattern p, freq(p, D), corre- 
sponds to the number of sequences s £ D such that p subsumes s. The support of 
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a pattern p with respect to a class c G C, supp^,(p, D) corresponds to the number 
of sequences s G D whose class label is c. Finally, the confidence of a pattern p 
with respect to a class c e C is defined as confc(p, D) = supp^{p, D)/beq{p, D). 

The refinement step The refinement of patterns is obtained by using a re- 
finement operator p that maps each pattern to a set of specialisations of the 
pattern, i.e. p{p) C {p'\p p'} where p p' means that p is more general of p' 
or that p subsumes p' . In particular, given the set T) of dimensions, the set T 
of fluent atoms, the set V of non-fluent atoms, for each i € V, the reflnement 
operator for specialising the patterns is deflned as follows: 

adding a non-dimensional atom 

— the pattern S is specialised by adding a non-dimensional atom; 
adding a dimensional atom 

— the pattern S is specialised by adding the dimensional atom (x <i y); 

— the pattern S is specialised by adding the dimensional atom {x <j y); 

— the pattern S is specialised by adding the dimensional atom (a; v)- 

The dimensional atoms are added if and only if there exists a fluent atom 
referring to its starting event. The length of a pattern P is equal to the number 
of non-dimensional atoms in P. 

For each specialisation level, before to start the next refinement step. Lynx 
records all the obtained patterns. Hence, it coiild happens to have in the final 
set a pattern p that subsumes a lot of other patterns in the same set. However, 
the subsumed patterns may have a different support, contributing in different 
way to the classification model. 

3.3 Pattern-based Classification 

After having identified the set of frequent patterns, now the task is how to use 
them as features in order to correctly classify unseen sequences. Let X be the 
input space of relational sequences, and let y = {1,2, . . . ,Q} denote the finite 
set of possible class labels. Given a training set D = {{Xi, Yi)\l <i < m}, where 
Xi G X is a, single relational sequence and Yi Gy is the label associated to Xi, 
the goal is to learn a function h : X ^ y from D that predicts the label for each 
unseen instance. 

Let P, with \P\ = d, he the set of constructed features obtained in the first 
step of the Lynx system (the patterns mined from D). For each sequence Xk G X 
we can build a d-component vector- valued x = {xi,X2, ■ ■ ■ , Xd) random variable 
where each Xj e x is 1 if the pattern pi G P subsumes the sequence Xk, and 
otherwise. 

Using the Bayes' theorem, if p{Yj) describes the prior probability of the class 
Yj, then the posterior probability p{Yj\x) can be computed from p{x\Yj) by 



p{Yj\x) = 



(1) 



Et^Pi^lYMYi) 
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Given a set of discriminant functions 5i(x), i = l,...,Q, a classifier is said 
to assign the vector x to the class Yj if 5j(x) > 5i(x) for all j ^ i. Taking 
gi(x) ~ P{Yi\x), the maximum discriminant function corresponds to the maxi- 
mum a posteriori (MAP) probability. For minimum error rate classification, the 
following discriminant function will be used 

g,(x) =lnp(x|yi) + lnP(rO- (2) 

Here, we are considering a multi-class classification problem involving discrete 
features, multi-class problem in which the components of the vector x are binary- 
valued and conditionally independent. In particular, let the component of the 
vector X = {xi,. . . , Xd) be binary valued (0 or 1). We define 

Pij = Prob(a;j = 1|Y,) i=i <j 

with the components of x being statistically independent for all Xi G x. In this 
model each feature Xi gives us a yes/no answer about the pattern p,. However, 
if pik > Pit we expect the i-th pattern to subsume a sequence more frequently 
when its class is Yk than when it is Yj. The factors Pij can be estimated from 
the training examples as frequency counts, as follows 

Pij = Prob(xi = l\Yj) 

= SUppOltYApi)i=l,...,d- 
j=l,...,Q 

In this way, the constructed features Pi may be viewed as probabilistic features 
expressing the relevance for the pattern pi in determining the classification Yj . 

By assuming conditional independence we can write P{x\Yi) as a product of 
the probabilities of the components of x. Given this assumption, a particularly 
convenient way of writing the class-conditional probabilities is as follows: 

d 

p{K\Yj) = i[{pi,r{i-Pi,)'-^^ (3) 

i=l 

Hence, the Equation 2 yields the discriminant function 

5,(x)= lnp(x|y,) + lnp(r,) = 
d 

InHipijr^l-Pijf-^^+lnpiYj) = 

d 

J2^n{{pijr{l-Pij)'-^^)+lnp{Yj) = 

i=l 

^ Xi In + ^ ln(l - Pij) + InpiYj) (4) 

i=l ^ i=l 
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The factor corresponding to the prior probabihty for the class Yj can be esti- 
mated from the training set as 

^y,), l((A-.r).Ds.t.y = y.}l .,.,g. 

The minimum probability of error is achieved by the following decision rule: 

decide Yk if 5fe(x) > 5j(x) for all j and fc, where gi{-) is defined as in Equation 4. 
Let we note that this discriminant function is linear in the Xi and thus we can 
write 

d 
i=l 

where ai = ln{pij/{l — Pij)), and /3o = X^f=i ln(l — Pij) + lnp{Yj). Recall that 
wc decide Yi if ,9i(x) > .9/c(x) for all i. The magnitude of the weight a; in .9j(x) 
indicates the relevance of a subsumption for the pattern pi in determining the 
classification Yj. This is the probabilistic characteristic of the features obtained 
in the feature construction phase, opposed to the Boolean feature. 



3.4 Feature Selection with stochastic local search 

After having constructed a set of features, and presented a method to use those 
features to classify unseen sequences, now the problem is how to find an optimal 
subset of these features that optimise the prediction accuracy. The optimisation 
problem of selecting a subset of features (patterns) with a superior classification 
performance may be formulated as follows. Let P be the constructed original set 
of patterns, and let / : 2l^l — >^ K a function scoring a selected subset X C V. 
The problem of feature selection is to find a subset X CV such that 

/(X)=max/(Z). 

An exhaustive approach to this problem would require examining all 2l^l possible 
subsets of the feature set P, making it impractical for even low values of \P\. 
The use of a stochastic local search procedure allows us to obtain good solutions 
without having to explore the whole sohition space. 

Given a subset P C "P, for each sequence Xj G X we let the classifier finds 
the MAP hypothesis adopting the discriminant function reported in Eq. 2: 

hp {Xj ) = arg max c/j (xj ) , (6) 

i 

where is the feature based representation of the sequence Xj obtained using 
the patterns P. Hence the initial optimisation problem corresponds to minimise 
the expectation 
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where 1^ 
fined as 



hpiXi)jiYi 



is the characteristic function of the training example Xi, de- 



^ fl iihp{X,)^Yi 
hp{Xi)^Yi |o otherwise 



Finally, given D the training set with \D\ = m and P a set of features 
(patterns), the number of classification errors made by the Bayesian model is 



GRASP^^ Consider a combinatorial optimisation problem, where one is given 
a discrete set X of solutions and an objective function / : X — > R to be min- 
imised and seeks a solution x* € X such that Va; e X : fix*) < fix). A 
method to find high-quality solutions for a combinatorial problem is a two steps 
approach consisting of a greedy construction phase followed by a pcrturbative 
local search [19]. The greedy construction method starts the process from an 
empty candidate solution and at each construction step adds the best ranked 
component according to a heuristic selection function. Then, a pcrturbative local 
search algorithm, searching a local neighborhood, is used to improve the candi- 
date solution thus obtained. Advantages of this search method are the much 
better solution quality and fewer perturbative improvement steps to reach the 
local optimum. 

Greedy Randomised Adaptive Search Procedures (GRASP) [20] solve the 

problem of the limited number of different candidate solutions generated by a 
greedy construction search method by randomising the construction method. 
GRASP is an iterative process combining at each iteration a construction and 
a local search phase. In the construction phase a feasible solution is built, and 
then its neighbourhood is explored by the local search. 

Algorithm 1 reports the GRASP^^ procedure included in the Lynx system 
to perform the feature selection task. In each iteration, it computes a solution 
iS G iS by using a randomised constructive search procedure and then applies a 
local search procedure to S yielding an improved solution. The main procedure 
is made up of two components: a constructive phase and a local search phase. 

The constructive search algorithm used in GRASP^^ iteratively adds a so- 
lution component by randomly selecting it, according to a uniform distribution, 
from a set, named restricted candidate list (RCL), of highly ranked solution 
components with respect to a greedy function g : S ^ The probabilistic 
component of GRASP^^ is characterised by randomly choosing one of the best 
candidates in the RCL. In our case the greedy function g corresponds to the er- 
ror function errniP) previously reported in Eq. 7. In particular, given erroiP), 
the heuristic function, and S, the set of feasible solutions, 



erroiP) = '^E[l^ 



hp(Xi)^Yi\- 



(7) 



s = m:m{errD{S)\S G S} 



and 



s = max{err£)(5)|S' G S} 
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Algorithm 1 GRASP^s 

Input: D: the training set; V: a set of patterns (features); maxiter: maximum number 

of iterations; erroiP)- the evaluation function (see Eq. 7) 
Output: solution S CV 

S = $, ermiS) = +oo 

iter = 

while iter < maxiter do 

a = rarid(0,l) 
/* construction */ 
S = 0; i = 
while i < n do 

S = {S'\S' = add{S,A)} 

s = umx{errD(T)\T £ S} 

s = mm{errD{T)\T e S} 

RCL = {S' € S\errD{S') <s + a{s- s)} 

select the new S, at random, from RCL 

i ^ i + 1 
/* local search */ 

M ={S' £ neigh{S)\errD{S') < erroiS)} 
while A/^ ^ do 

select S e Af 

TV ^ {5" € neigh{S)\errDiS') < ermiS)} 
if erroiS) < erroiS) then 

S^S 
iter = iter + 1 
return S 
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are computed. Then the RCL is defined by including in it ah the components S 
such that 

erroiS) > s + a{s — s). 

The parameter a controls the amounts of greediness and randomness. A value 
a = 1 corresponds to a greedy construction procedure, while a = produces 
a random construction. As reported in [21], GRASP with a fixed nonzero RCL 
parameter a is not asymptotically convergent to a global optimum. The solution 
to make the algorithm asymptotically globally convergent, could be to randomly 
select the parameter value from the continuous interval [0, 1] at the beginning of 
each iteration and using this value during the entire iteration, as we implemented 
in GRASP^^. Hence, starting from the empty set, in the first iteration all the 
subsets containing exactly one pattern are considered and the best is selected for 
further specialisation. At the iteration i, the working set of patterns S is refined 
by trying to add a pattern belonging to 'P\S. 

To improve the solution generated by the construction phase, a local search is 
used. It works by iteratively replacing the current solution with a better solution 
taken from the neighbourhood of the current solution while there is a better 
solution in the neighbourhood. In order to build the neighbourhood of a solution 
S, neigh(S), the following operators have been used. Given V the set of patterns, 
and S = {pi,P2, ■ ■ ■ ,Pt} QV a solution: 

add: S ^ SU {p^} where e P \ 5: 

remove: S ^ S\{pi}U {pk} where Pi & S and Pk €V\S. 

In particular, given a solution S G S, the elements of the neighborhood 
neigh{S) of S are those solutions that can be obtained by applying an elementary 
modification (add or remove) to S. Local search starts from an initial solution 

£ S and iteratively generates a scries of improving solutions ,8"^ , . . .. At 
the A;-th iteration, neigh{S^) is searched for an improving solution S*^"*"^ such 
that err£)(S''^+^) < erroiS'^). If such a solution is found, it is made the current 
solution. Otherwise, the search ends with S'' as a local optimum. 

4 Experiments 

Experiments have been conducted on protein fold classification, an important 
problem in biology since the functions of proteins depend on how they fold up. 
The dataset, already used in [13, 12, 14] is made up of logical sequences of the 
secondary structure of protein domains. The task is to predict one of the five 
most populated SCOP folds of alpha and beta proteins (a/b): TIM beta/alpha- 
barrel (cl), NAD(P)-binding Rossmann-fold domains (c2), Ribosomal protein L4 
(c23), Cysteine hydrolase (c37), and Phosphotyrosine protein phosphatases I-like 
(c55). The class of a/b proteins consists of proteins with mainly parallel beta 
sheets (beta-alpha-beta units). Overall, the class distribution is 721 sequences 
for the class cl, 360 for c2, 274 for c23, 441 for c37 and 290 for c55. 
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Conf. 


Lynx 


Folds 

123456789 10 


Mean 


().!).") 


w/o GRASP"'' 
w GRASl''''^ 


0.84 0.88 0.83 0.83 0.85 0.76 0.85 0.81 0.82 0.80 
0.88 0.92 0.88 0.88 0.89 0.81 0.93 0,87 0.90 0.9.3 


0.826 
0.878 


1.0 


w/o GRASP'^ '' 
w GRASP^^ 


0.89 0.94 0.84 0.92 0.94 0.88 0.91 0.89 0.88 0.87 
0.94 0.97 0.93 0.95 0.95 0.93 0.93 0.97 0.90 0.94 


0.896 
0.942 



Table 1. Cross- validated accuracy of Lynx with and without feature selection on two 
values for the confidence. 



As in [14] , we used a round robin approach [22] , treating each pair of classes 
as a separate classification problem, and the overall classification of an example 
instance is the majority vote among all pairwise classification problems. 

Table 1 reports the experimental results of a 10-fold cross- validated accuracy 
of Lynx. Two experiments have been conducted, one with a confidence level 
equal to 0.95 and the other with a confidence level of 1.0. In particular, given 
the training data D, we imposed that confc(p, D) = 0.95 (resp. confc(p, D) = 1). 
For each experiment. Lynx has been applied on the same data with and without 
feature selection. In particular, we applied the classification on the test instances 
without applying GRASP^^ in order to have a baseline accuracy value. Indeed, as 
we can see, the accuracy grows when GRASP^^ optimises the feature set, proving 
the validity of the method adopted for the feature selection task. Furthermore, 
the accuracy level grows up when we mine patterns with a confidence level equal 
to 1.0 corresponding to save jumping emerging patterns'^ only. This proves that 
jumping patterns have a discriminative power greater than emerging patterns^ 
(when the confidence level is equal to 0.95). 

As as second experiment we compared Lynx on the same data to other statisti- 
cal relational learning systems, whose cross- validated accuracies are summarised 
in Table 2. In particular, LoHHMs [12] were able to achieve a predictive accuracy 
of 75%, Fisher kernels [13] achieved an accuracy of about 84%, TildeCRF [14] 
reaches an accuracy value of 92.96%, while Lynx obtains an accuracy of 94.15%. 
Hence, we can conclude that Lynx performs better than established methods on 
real-world data. 



System 


Accuracy 


LoHMMs [12] 


75% 


Fisher kernels [13] 


84% 


TildeCRF [14] 


92.96% 


Lynx 


94.15% 



Table 2. Cross-validated accuracy of LoHHMs, Fisher kernels, TildeCRF and Lynx 



^ A jumping emerging pattern is a pattern with non-zero support on a class an a zero 

support on all the other classes, i.e. with a confidence equal to 1. 
^ An emerging pattern is a pattern with a grow rate greater that 1. 
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5 Conclusions 

In this paper we considered the problem of multi-class relational sequence learn- 
ing using relevant patterns discovered from a set of labelled sequences. We firstly 
applied a feature construction method in order to map each relational sequence 
into a feature vector. Then, a feature selection algorithm to find an optimal sub- 
set of the constructed features leading to high classification accuracy has been 
applied. The feature selection task has been solved adopting a wrapper approach 
that uses a stochastic local search algorithm embedding a naive Bayes classifier. 
The performance of the proposed method applied to a real- world dataset shows 
an improvement when compared to other established methods. 
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